11/11/2022

glmmTMB

  • Generalized linear mixed models (GLMMs) can be useful for non-normal data with random effects
  • \(\texttt{glmmTMB}\) has a familiar interface based off \(\texttt{lme4}\)
  • \(\texttt{glmmTMB}\) is a fast, flexible and stable package (Brooks et al. 2017)
  • It has more distributions available
  • Plus flexible zero-inflated models and hurdle models

House work

You can download the folder directly from github:

Or clone

git clone https://github.com/mmcgillycuddy/rr_tutorial.git

House work

First, let’s load some packages

packages <- c("ggplot2", "dplyr", "tidyr", "glmmTMB", "ecostats")
install.packages(setdiff(packages, rownames(installed.packages()))) 
invisible(lapply(packages, library, character.only = TRUE))
## Warning: package 'ecostats' was built under R version 4.2.2
source("functions.R")

Wind Farm Example

  • A study in Sweden investigated the effects of an offshore windfarm on the abundance and distribution patterns of benthic fish communities

Wind Farm Example: Why GLMM?

Generalised Linear Mixed Model

A generalised linear mixed model (GLMM) has the following form:

\[ g(\boldsymbol{\mu}) = \boldsymbol{ X \beta} + \boldsymbol{Zb} \] where

  • g(.) is a known link function
  • \(\boldsymbol{X}\) is a \(n \times p\) model matrix
  • \(\boldsymbol{\beta}\) is a \(p\)-dimensional vector of regression coefficients related to the covariates
  • \(\boldsymbol{Z}\) is a \(n \times q\) model matrix for the \(q\)-dimensional random effects
  • \(\boldsymbol{b}\) are multivariate random effects, \(\boldsymbol{b} \sim N(\boldsymbol{0}, \boldsymbol{\Sigma} )\)
  • \(\boldsymbol{y} | \boldsymbol{b} \sim F(\mu, \phi )\), where F is a member of the exponential family, mean \(\boldsymbol{\mu}\), nuisance \(\boldsymbol{\phi}\)

Generalised Linear Mixed Model

Potential reasons why a GLMM is needed:

  • data has clusters of non-independent observations that are hierarchical
  • only a random sample from a large number of potential levels of a factor
  • the effect is a nuisance variable, i.e. not an interest but we should control for it

GLMM: Wind farm example

Consider the joint model of the mean abundance, \(\mu_{ijk}\), observed at \(i = 1, \ldots, n\) observation, \(k = 1,\ldots, l\) stations, for \(j = 1, \ldots, p\) species as follows:

\[ g(\mu_{ijk}) = \boldsymbol{x_i'\beta} + \boldsymbol{x_i'b_{j}^{[x]}} + {b_{k}^{[s]}} + b_{ij} \]

  • \(\boldsymbol{x_i}\) are vectors of covariates relating to intercept, zone, year and the interaction of zone and year
  • \(\boldsymbol{\beta}\) are vectors of fixed coefficients
  • \(\boldsymbol{b_{j}^{[x]}} \sim N(\boldsymbol{0}, \boldsymbol{\sigma_x^2I})\)
  • \({b_{k}^{[s]}} \sim N(0, \sigma^2_s)\)
  • \(\boldsymbol{b_{i}} = (b_{i1},\ldots,b_{ip}) \sim N(\boldsymbol{0}, \Sigma_{sp})\)

GLMM: Wind farm example

The wind farm data is available in the ecostats package.

wf <- cbind(windFarms$abund, windFarms$X) %>% 
  mutate(ID = 1:nrow(.)) %>% 
  pivot_longer(cols = colnames(windFarms$abund), 
               names_to = "Species",
               values_to = "Abundance")

Wind Farm Example: Data

head(wf)
## # A tibble: 6 × 7
##   Year  Zone  Station Impact    ID Species     Abundance
##   <fct> <fct> <fct>   <fct>  <int> <chr>           <int>
## 1 2003  WF    8.1     Before     1 Bergvar             0
## 2 2003  WF    8.1     Before     1 Oxsimpa             0
## 3 2003  WF    8.1     Before     1 Piggvar             0
## 4 2003  WF    8.1     Before     1 Roodspotta          0
## 5 2003  WF    8.1     Before     1 Rootsimpa           0
## 6 2003  WF    8.1     Before     1 Sandskaadda         0

If you want more information you can find it in the help file:

?ecostats::windFarms

Wind Farm Example: Data

  • Keep stations that were observed in both years
  • Remove species that were observed less than 5 times
stat03 <- unique(wf[wf$Year=="2003",]$Station)
stat10 <- unique(wf[wf$Year=="2010",]$Station)
sum_abund <- wf %>%
  filter(Station %in% stat03 ) %>%
  group_by(Species) %>%
  summarise(obs = sum(Abundance > 0) ) %>% 
  ungroup() %>% 
  arrange(-obs) # order from most to least abundant

spp_names <- unique(sum_abund$Species[sum_abund$obs>=5])
wf_data <- wf %>% 
  filter(Species %in% spp_names) %>% 
  mutate(Species = factor(Species, levels = spp_names))

Wind Farm Example: Data

## Boxplot of abundance of species by year and zone
labs=c(0,1,10,100)
wf_boxplot <- wf_data %>% 
  ggplot(aes(x = Year, y = log(Abundance + 1), fill = Zone)) +
  geom_boxplot(outlier.shape = 21)+
  facet_wrap(~ Species, ncol = 3, scales = "free") +
  labs(x = "Year", color  = "Zone", y = "Abundance [log(y+1)-scale]") +
  scale_y_continuous( breaks = log(labs + 1), labels = labs)+
  theme_classic() +
  theme(legend.position="bottom"  ) +
  scale_fill_brewer(palette = "Set2")

Wind Farm Example: Data

wf_boxplot

GLMM: wind farm

We assume no prior structure about the correlation across species, i.e., unstructured covariance matrix.

fit.us <- glmmTMB(Abundance ~ Year*Zone + diag(Year*Zone|Species) + 
                 (1 | Station) +
                 (Species + 0|ID),
               wf_data, family = poisson)
## Warning in fitTMB(TMBStruc): Model convergence
## problem; non-positive-definite Hessian matrix. See
## vignette('troubleshooting')
## Warning in fitTMB(TMBStruc): Model convergence problem;
## false convergence (8). See vignette('troubleshooting')

GLMM: reduced-rank random effects

A reduced-rank approach to fit a multivariate random effect

\[b_{ij} = \boldsymbol{\lambda}_j \boldsymbol{u}_{i}\]

where

  • \(\boldsymbol{u_i}\) is a vector of \(d\) latent variables
  • \(\boldsymbol{u_i} \sim N(\boldsymbol{0}, \boldsymbol{I})\)
  • \(\boldsymbol{\Lambda} = \left\{\boldsymbol{\lambda}_1^\top, \ldots, \boldsymbol{\lambda}_p\right\}^\top\) are a matrix of factor loadings

This implies \(\boldsymbol{b_i} \sim N(\boldsymbol{0}, \boldsymbol{\Lambda}\boldsymbol{\Lambda'})\)

GLMM: reduced-rank model

wf.glmm <- glmmTMB(Abundance ~ Zone*Year + diag(Zone*Year|Species) +
                     (1|Station) + 
                     rr(Species + 0|ID, 2), 
                   family = "poisson", data = wf_data,
                   control = glmmTMBControl(start_method = list(method = "res")))
  • The number after the comma in the rr formula specifies the rank of the variance-covariance matrix of the multivariate random effects.

  • The start_method argument in glmmTMBControl() controls the algorithm used for initialising the rr parameters (\(\boldsymbol{\lambda}_j\) and \(\boldsymbol{u}_{i}\))

GLMM: summary

summary(wf.glmm)
##  Family: poisson  ( log )
## Formula:          
## Abundance ~ Zone * Year + diag(Zone * Year | Species) + (1 |  
##     Station) + rr(Species + 0 | ID, 2)
## Data: wf_data
## 
##      AIC      BIC   logLik deviance df.resid 
##   2914.1   3075.6  -1427.1   2854.1     1581 
## 
## Random effects:
## 
## Conditional model:
##  Groups  Name                 Variance Std.Dev. Corr       
##  Species (Intercept)          3.867106 1.96650             
##          ZoneN                0.907514 0.95264  0.00       
##          ZoneS                4.070893 2.01765  0.00  0.00 
##          Year2010             1.676828 1.29492  0.00  0.00 
##          ZoneN:Year2010       0.837704 0.91526  0.00  0.00 
##          ZoneS:Year2010       0.330834 0.57518  0.00  0.00 
##  Station (Intercept)          0.033221 0.18227             
##  ID      SpeciesTorsk         0.005311 0.07288             
##          SpeciesTanglake      0.214615 0.46327   0.91      
##          SpeciesAL            0.370471 0.60866  -0.45 -0.78
##          SpeciesOxsimpa       0.116793 0.34175  -1.00 -0.89
##          SpeciesStensnultra   0.575215 0.75843   0.58  0.19
##          SpeciesRootsimpa     0.427262 0.65365   0.83  0.53
##          SpeciesSkrubbskaadda 0.592405 0.76968   0.37 -0.06
##          SpeciesSjurygg       0.505704 0.71113   0.67  0.30
##          SpeciesSandskaadda   3.149266 1.77462   0.67  0.30
##                                      
##                                      
##                                      
##                                      
##  0.00                                
##  0.00  0.00                          
##  0.00  0.00  0.00                    
##                                      
##                                      
##                                      
##                                      
##   0.41                               
##   0.47 -0.62                         
##   0.12 -0.86  0.93                   
##   0.67 -0.41  0.97  0.82             
##   0.36 -0.70  0.99  0.97  0.94       
##   0.36 -0.71  0.99  0.97  0.93  1.00 
## Number of obs: 1611, groups:  
## Species, 9; Station, 108; ID, 179
## 
## Conditional model:
##                Estimate Std. Error z value Pr(>|z|)   
## (Intercept)    -1.95460    0.70374  -2.777  0.00548 **
## ZoneN          -0.15726    0.42626  -0.369  0.71219   
## ZoneS          -0.04281    0.73650  -0.058  0.95365   
## Year2010       -0.14090    0.51263  -0.275  0.78343   
## ZoneN:Year2010 -0.17819    0.46092  -0.387  0.69906   
## ZoneS:Year2010  0.62530    0.39446   1.585  0.11292   
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

GLMM: factor loading

The factor loadings from a model can be obtained from the report

fl <- wf.glmm$obj$env$report(wf.glmm$fit$parfull)$fact_load[[3]]
fl
##              [,1]        [,2]
##  [1,] -0.07287623  0.00000000
##  [2,] -0.42096319 -0.19340382
##  [3,]  0.27233196  0.54434034
##  [4,]  0.34139302 -0.01560407
##  [5,] -0.44261814  0.61587704
##  [6,] -0.54556057  0.36003573
##  [7,] -0.28097141  0.71656152
##  [8,] -0.47609504  0.52824010
##  [9,] -1.19386109  1.31299705

GLMM: ordination plot

We can visualise the correlations between species from the estimated latent variables and factor loadings from a model without predictors, i.e., an unconstrained ordination plot

wf.glmm.unc <- glmmTMB(Abundance ~  Species + rr(Species + 0 | ID, d = 2),
                       family = "poisson", data = wf_data, 
                       control = glmmTMBControl(start_method = list(method = "res")  ))

To get the factor laodings and latent variables for the reduced-rank random effect:

wf.unc.fl.b <- extract.rr.par(wf.glmm.unc)
unc.wf.b <- as.data.frame(wf.unc.fl.b$b)
unc.wf.b$ID <- rownames(unc.wf.b)
unc.wf.fl <-  as.data.frame(wf.unc.fl.b$fl)

GLMM: ordination plot

cnms <- wf.glmm.unc$modelInfo$reTrms[["cond"]]$cnms
labels <- gsub("Species", "", cnms$ID)
names(unc.wf.b)[1:ncol(unc.wf.fl)] <- paste0("LV", 1:ncol(unc.wf.fl))

zone_col <- wf_data %>%
  select(ID, Zone, Year) %>%
  distinct() %>%
  mutate(ID = as.character(ID))

unc.wf.b <- left_join(unc.wf.b, zone_col, by = "ID")

ordi.plot <- ggplot(unc.wf.b) +
  geom_point(aes(x = LV1, y = LV2, color = Zone, shape = Year) , size = 2) +
  geom_text(data = unc.wf.fl, aes(x = V1, y = V2, label = labels),  colour="dark grey") +
  theme_classic() +
  labs(x = "Latent variable 1", y = "Latent variable 2")+
  scale_color_brewer(palette = "Dark2")

GLMM: ordination plot

ordi.plot

GLMM: ordination plot

We can compare the unconstrained ordination plot to a residual ordination plot, i.e., after including zone, year, their interaction and station in the model.

Random slopes model

Random slopes model example: PIRLS

  • Fit a random slopes model with heaps of slopes

  • Literacy of students across the world (PIRLS)

  • We are interested in the effect of the interaction between economic disadvantage and the size of the school library.

  • Allow the interaction to vary by country

Global literacy - PIRLS

Consider the model of overall literacy score, \(y_{ijk}\), for student \(i = 1, \ldots, n\), in school \(j = 1,\ldots, l\) and country \(k = 1, \ldots, p\) as follows:

\[ y_{ijk} = \boldsymbol{x_j'\beta} + {b_{j}^{[s]}} + \boldsymbol{x_j'b_{jk}^{[rr]}} + \epsilon_{ijk} \]

  • \(\boldsymbol{x_j}\) are vectors of covariates relating to school factors
  • \(\boldsymbol{\beta}\) are vectors of fixed coefficients
  • \({b_{j}^{[s]}} \sim N(0, \sigma^2_s)\)
  • \(\boldsymbol{b_{k}^{[rr]}} \sim N(\boldsymbol{0}, \boldsymbol{\Lambda \Lambda'})\) where \(\boldsymbol{\Lambda}\) is the full matrix of factor loadings.
  • \(\epsilon_{ijk} \sim N(0, \sigma^2)\)

PIRLS: reduced-rank model

This may take a little longer to fit

load("data/pirls_data.RData")

fit.rr <- glmmTMB(Overall ~  Size_lib*Eco_disad +
                    (1 |School) +
                    rr(Size_lib*Eco_disad | Country, 3),
                  data = pirls,
                  family = gaussian(),
                  control = glmmTMBControl(optCtrl=list(iter.max=1e5,eval.max=1e3),
                                           start_method = list(method = "res")))

PIRLS: reduced-rank estimates

We can plot the reduced-rank estimates, i.e. \(\boldsymbol{\lambda_j u_k}\)

### names for random effect terms, e.g. (x | group)
cnms <- fit.rr$modelInfo$reTrms[["cond"]]$cnms   ## list of terms, i.e. names of x
flist <- fit.rr$modelInfo$reTrms[["cond"]]$flist ## list of grouping variables
levs <- lapply(flist, levels) # levels of group
# (nb is the number of levels in group)
nb <- vapply(fit.rr$modelInfo$reStruc$condReStruc, function(x) x$blockReps, numeric(1)) ## number of blocks per RE (may != nlevs in some cases)

## Note: b is the name for all latent variables in the model, i.e., all random effects in the model
## u.rr will refer to the latent variables of rr block
## Then the estimates of the rr random effect are b.rr = lambda * u.rr
## lambda is a (nc x d) matrix of factor loadings (nc is the number of levels in x, i.e. school vars)
## where U ~ N(0, 1) and u.rr is (nb x d) matrix
block.rr <- 2
fl <- fit.rr$obj$env$report(fit.rr$fit$parfull)$fact_load[[block.rr]]
rownames(fl) = cnms[[block.rr]]
u.rr <- ranef(fit.rr)$cond[[block.rr]][,1:ncol(fl)]
b.rr <- fl %*% t(u.rr)

PIRLS: reduced-rank estimates

More work and more code is need to get the standard errors, but this is what we can end up with:

PIRLS: random slopes plot

References

Brooks, Mollie E, Kasper Kristensen, Koen J van Benthem, Arni Magnusson, Casper W Berg, Anders Nielsen, Hans J Skaug, Martin Machler, and Benjamin M Bolker. 2017. “glmmTMB Balances Speed and Flexibility Among Packages for Zero-Inflated Generalized Linear Mixed Modeling.” The R Journal 9 (2): 378–400.